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1  By  . 

i  D.  i  i'-  tin" 


Introduction  _ _  ^7 _ ^ _ J _ _ 

-  Ajtai,  Komlos,  and  Szemeredi  [1]  recently  proposed  a  sorting  network 
(referred  to  hereafter  as  the  AKS  network),  of  O(nlogn)  comparators  and 
O(logn)  depth.  Their  construction  is  of  great  theoretical  interest,  for 
it  shows  that  O(nlogn)  comparisons  suffice  to  sort  n  elements,  even  under 
the  constraint  that  comparisons  be  nonadaptively  executed  in  O(logn)  parallel 
stages.  At  present,  the  AKS  network  appears  not  suitable  for  practical 
implementations,  due  to  the  large  value  of  the  constants;  however,  improve¬ 
ments  are  conceivable  that  could  make  the  network  more  attractive  for 
real-world  applications. 

It  is  therefore  natural  to  ask  what  is  the  performance  of  the  AKS  network 
in  the  synchronous  VLSI  model  of  computation  which  has  been  proposed  [2]  to 
capture  the  essential  features  of  planar  very  large  scale  integration  as  a 
computing  environment. 

In  this  model  it  is  known  that  any  chip  capable  of  sorting  n  words  of 

”  V'"  a  -'■?>-  *>'<*  2  2  2 

length  q  »  (l+^)logn,  with  p  >  0 ,  must  satisfy  the  relationship^  AT  *  ft(n  log~n) , 
where  A  is  the  jchip  area,  and  T  is  the  computation  time.  (-.This  lower  bound 


has  been  originally  obtained  by  Thompson  [2]  under  the  word  local  restriction 
(all  the  bits  of  the  same  word  enter  the  circuit  at  the  same  input  port) . 
Recently  Leighton  [3]  has  shown  that  the  lower  bound  holds  valid  even  for 
non-word-local  designs. 


This  work  has  been  supported  in  part  by  the  Joint  Services  Electronics  Program 
under  Contract  N00014-79-C-0424  and  by  the  IBM  Predoctoral  Fellowship  Program. 


Many  designs  of  VLSI  sorters  have  already  been  proposed  (see  Thompson 

[4]  for  a  survey).  We  mention  here  the  ones  that  achieve  minimum  area 
2  2  2 

A  «  8(n  log  n/T  )  at  their  computation  time  T: 

-  the  mesh-connec  ted  [1,5,6]  bitonic  sorter  [7],  for  T  ■  0  (  i/n) . 

-  the  pleated— cube-connected-cycles  (PCCC)  [S]  also  implementing 
bitonic  sorting  for  T  in  the  range  [CHlog  n)  ,0(/nlogn)  ] . 

-  a  hybrid  architecture  based  on  the  cube-connec ted-cycles  and  the 
orthogonal  trees  interconnections  T9],  which  implements  the 
enumeration  sorting  schemes  of  [10]  ,  and  works  in  minimum  computation 
time  T  ■  O(logn) . 

-  a  hybrid  architecture  consisting  of  orthogonal  trees  and  permuter 
networks  [3],  which  implements  a  generalization  of  the  even-odd  sort 
[7],  and  also  works  in  time  T  -  O(logn). 

It  is  Chen  interesting  to  see  how  the  AKS  algorithm,  which  is  radically 
different  from  any  ocher  known  sorting  paradigm,  compares  with  more  classical 
sorting  methods  in  the  VLSI  environment,  where  the  heaviest  demand  cf  resources 
usually  comes  from  communication,  rather  chan  from  computing  requirements,  so 
chat  a  small  number  of  processing  elements  does  not  necessarily  imply  a  good 
performance. 

In  this  note  we  show  that  the  AKS  sorting  network  can  indeed  be  laid  out 
2 

in  area  A  *  0(n  ),  while  maintaining  an  0(logn)  computation  time,  thereby 


establishing  its  optimality  in  the  VLSI  model  of  computation. 


Layout  of  the  AKS  Network 


The  original  description  [1]  of  the  AKS  network  (with  n  inputs)  is  given 

in  terms  of  an  n-node  graph  G  •  (V,E),  whose  nodes  are  registers,  and  whose 

edges  are  comparators.  The  set  of  edges  E  is  partitioned  as  E  ■  E^  U  E^  U  . ..  ( 

where  each  of  the  E  's  is  a  (possibly  partial)  matching  on  V,  and  N  <  8  logn  for 

s 

some  (very  large)  constant  g.  Since  each  Eg  (s  -  1,...,N)  is  a  (possibly 
partial)  matching,  all  of  its  comparators  can  be  simultaneously  active.  Thus 
the  AKS  sorting  algorithm  can  be  described  as  follows: 
begin  for  s  1  to  N 

for  all  (x,y)  €  Eg,  and  x  <  y  pardo 

(R(x) ,R(y) ) :■  (min(R(x) ,R(y)) ,max(R(x) ,R(y))) 

end 

where  R(x)  is  the  content  of  the  register  associated  with  node  x. 

Since  the  embedding  of  a  graph  in  a  planar  grid  requires  nodes  of  bounded 
degree,  we  shall  modify  the  original  description  as  follows.  According  to  a 
scheme  described  by  Knuth  [11],  we  consider  n  lines  that  run  parallel,  say, 
to  the  horizontal  axis.  On  line  r  (r  ■  l,2,...,n)  there  will  be  N  processors 
P[ r » 1] » • • • ,P[ r »N] ,  whose  capability  will  be  specified  below.  For  each 
s  »  1,2,...,N,  and  for  each  (x,y)  €  Eg,  we  connect  processors  P[x,s]  and 
?[y ,s]  by  a  vertical  line.  Such  vertical  line  supports  the  execution  of  the 
comparison-exchange  (R(x) ,R(y))  :  -  (min(R(x) ,R(y)) ,max(R(x) ,R(y)) ) ,  where 
R(x)  and  R(y)  are  respectively  the  operands  stored  in  P[x,s]  and  P L y , s ] . 

Once  the  comparison-exchanges  specified  by  Eg  have  been  executed,  the 
results  will  be  for-warded  on  each  line  (that  is,  from  P[x,s]  to  P[x,s+1], 


This  basic  layouc  can  be  further  specified  by  selecting  the  degree  of 
parallelism  of  the  operand  transmission.  Due  to  the  amenability  to 
pipelined  operation,  the  q-bit  operands  are  fed  in  bit-serial  fashion 
starting  with  the  most  significant  bit  and  each  processor  is  equipped  with 
a  serial  comparator.  In  each  comparator,  as  long  as  the  two  inputs  agree, 
they  are  transmitted  to  the  next  processor  on  the  same  line.  As  soon  as  a 
bit  discrepancy  is  detected,  a  switch  is  set  and,  from  then  on,  the  remaining 
substrings  of  each  of  the  operands  will  follow  a  fixed  path  independently 
of  their  value. 

Thus  we  have  ensured  that  the  AKS  network  works  in  T  •  O(logn+q)  *  O(logn) 

time,  and  we  turn  our  attention  to  the  layout  area.  We  first  observe  that 

both  the  horizontal,  and  the  vertical  lines  are  of  0(1)  width.  It  is  then 

simple  to  conclude  that  the  height  of  the  entire  layouc  is  0(n).  On  the 

ocher  hand,  any  matching  of  n  lines  can  be  easily  laid  out  in  (at  most)  n/2 

vertical  cracks  of  constant  width,  by  using  a  track  for  each  edge  of  the 

matching.  Since  there  are  N  •  O(logn)  matchings  to  be  cascaded  in  the  AKS 

2 

network,  it  is  readily  proved  that  O(nlogn)  width,  and  therefore  0(n  logn) 

area,  suffices  for  the  layout.  A  closer  analysis  however,  reveals  that  many 

of  the  matchings  E  ,...,E  are  such  that  many  edges  can  be  laid  out,  without 
1  N 

overlap,  in  the  same  vertical  track,  yielding  the  conclusion  chat  the  bound 

2 

for  the  area  can  be  lowered  to  0(n  ). 

To  establish  this  claim  we  introduce  the  following  top-down  description 
of  the  layout  of  the  AKS  network.  The  layout  could  be  analyzed  as  the  assembly 
of  suitable  simpler  building  blocks,  whose  hierarchy  is  illustrated  in  Figure  1 
Each  of  these  building  blocks  will  now  be  described  in  detail,  in  a  top-down 


fashion. 


AKS  network 


1+3  logn 


1 


log(l/n) 


Figure  1.  Hierarchy  of  building  blocks  of  the  AKS  network.  The  depth  is 
expressed  as  the  length  of  the  cascade  of  blocks  of  the 
immediately  lower  level. 


(1)  The  AKS  network  on  n  ■  2^  inputs  is  the  cascade  of  (l+3d)  stages, 

called  cherry  stages,  and  denoted  by  sq  ,Sn  ,Si2  ,S13  ’  "  ‘  ,Sdl ’Sd?  ,Sd3 
(Figure  2) . 


(2)  To  each  cherry  scage  S  ,  (t  *  1 , . . . ,d;  h  ■  1,2,3)  there  corresponds  a 

t»h 

partition  P^  ^  of  the  integers  (lines)  l,2,...,n.  Although  the  assignment 

of  the  integers  to  the  partition  blocks  is  too  complicated  to  be  repeated 

here  (the  reader  is  referred  to  [1]),  what  is  important  now  are  the 

properties  of  P  ,  that  are  relevant  to  the  layout.  Specifically,  P  . 

c  f  n  *  i  •• 

consists  of  the  following  (disjoint)  blocks: 


P  -  ?t3  -  (Tt(2i,j):  i  -  0,1, ... ,  L(t-l) /2J  ;  j  - 

P  -  {T  (2i-l),j):  i  •  1,2,. . . ,  LC/2J ;  j  *  1,2,... 
t2  t 

To  stage  SQ  there  corresponds  the  trivial  partition  Pq 
block  only. 


1,2, 


} 


,22i_1}  'J  {Tt(-1,0)>. 


consisting  of  one 


If  we  now  define  as  span(T)  the  smallest  interval  of  (l,...,a>  containing 
TC  { 1 , . . .  ,a } ,  we  have  the  following  properties: 

(1)  For  given  t  and  i,  and  j'  +  j,  span(Tc(i,  j))  n  span  (T^i,  j  ') )  *  *• 

(2)  |span(T  (i,J))l  £  n/21  for  every  t  and  j . 

(3)  :T  (i,J)i  <  Y  n/21  Ai_C  for  every  j,  where  y  and  A  -  2a  >  1  are 
constants . 


The  lines  numbered  by  the  integers  in  a  block  Tt(i,j)  are  involved  in  a 
network  of  comparators  called  an  n-nearsorter  (see  Figure  3)  .  Properties  (1)  and 
(2)  show  that  for  any  fixed  t  and  i,  all  n-nearsorters  corresponding  to 
(T#_(i,j)  :  j  -  1,2,..., 2*  }  can  be  laid  out  in  the  same  vertical  strip 

as  shown  in  Figure  3.  Moreover,  all  nearsorters  in  the  same  cherry  stage 


can  operate  in  parallel  (indeed,  no  two  share  a  line). 


(II. I) 
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■ 
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Figure  3.  Typical  cherry  stages  and  S  £  (c  is  even  in  the  figure) 


The  region  labelled  T  (i,j)  correspond  to  the  layout  of  an 
n-nearsorter. 


(3)  An  n-NEARSORTER,  corresponding  to  block  Tc(i,j),  has  the  structure 


of  a  full  binary  tree  of  depth  log^  — .  Each  node  of  this  tree  is  a 


network  of  comparators,  called  an  e-HALVER  (see  (4)),  encompassing  an 


interval  of  lines  (Figure  4).  If  a  ■  |T  (i,j)|,  then  the  root  encompasses 


m  lines;  if  a  node  v  of  the  tree  encompasses  s  lines,  then 
its  two  offsprings  encompass  each  (approximately)  s/2  lines. 


v) 


fc« 

V 


■1 

3 


in  inputs 


lines  of. 

Tc(i,j)_ 


e-HALVER 


m/2  e-HAL 


e-HAL 


e-HAL 


nm  inputs 
e-HALVER 


Figure  4.  An  n-NEARSORTER  is  a  full  binary  tree  of  e-HAL VERS, 


(■*)  An  e-HALVER  stage  on  o  lines  (with  e  <  n/(log  l/-i))  consists  of  the 
cascade  of  c  (where  c  is  a  function  of  e,  but  is  independent  of  m) 
one-factor  stages  (matching  stages).  (When  the  network  is  viewed  as 
a  graph  G  »  (V,E),  i.e.  when  each  line  is  shrunk  to  a  single  node,  the 
e-HALVER  becomes  an  expander  graph  on  the  set  of  nodes  on  which  its 
edges  are  incident.)  (See  Figure  5.) 


Figure  5.  An  e-HALVER  is  a  cascade  of  a  constant  number  of  one-factors . 

(5)  Finally  a  one-factor  stage  on  m  lines  is  a  matching  between  the  lower 

and  the  upper  half  of  these  lines,  and  it  is  a  subset  of  exactly  one 

of  the  sets  {E  :  s  *  1,...,N}  introduced  earlier.  (See  Figure  6.) 
s 


Figure  6.  A  one-factor  is  a  matching  between  the  top  and  the  bottom  half 
of  lines. 


Now  we  proceed,  bottom-up,  to  analyze  the  area  of  the  network. 

(i)  A  one-factor  stage  on  m  lines  can  be  laid  out  in  0(m)  length,  by 


allocating  a  vertical  track  for  each  of  the  m/2  edges.  The  height 
of  the  layout  will  be  proportional  to  the  distance  between  the 
topmost  and  the  bottommost  of  the  input  lines. 

(ii)  An  e-HALVER  has  a  length  of  0(cm);  c  is  the  valence  of  the  e-HALVER. 

(iii)  An  n-NEARSORTER  has  a  length  aiso  of  0(cm) ,  since  the  length 
of  the  e-HALVERS  decreases  geometrically  with  the  level. 

(iv)  We  now  subdivide  the  layout  into  vertical  slabs,  with  slab(t,i) 
containing  the  nearsorter  on  sets  T^Ci.j)  for  all  suitable  values 
of  j.  (There  are  in  fact  two  identical  copies  of  Tt(i,j)  when  i  is 
even,  but  this  will  only  affect  constant  factors.)  From  point  (iii) 
and  property  (3)  it  immediately  follows  that 

i(t,i)  =  length  of  slab(t,i)  <_  y  2  ^A1,  C 
Then,  the  total  length  l  can  be  obtained  by  summing  Z(t,i)  over  all  the 
vertical  slabs: 


d  t  d  d 

l  *  E  Z  £(t,i)  -  2  Z  2,( t , i ) 

t*Q  i»0  i»0  t«*i 

d  .  d  7 

<  Y  a  Z  2_1  Z  (1/A)t_1  <  7-77777  a. 
i-0  t-i  i'U/A; 

In  conclusion  A  M  height  *  length  *  0(n)  x  0(n)  ! 


0(n“)  as  claimed. 
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